Previously we have tidy up education statistics acquired from World Bank and have chosen statistics from Barro-Lee. Now we are merging this data set with the original endangered languages list together with obtained elevation from Google.
Showing Unique number to grasp the data source acquisition year, it ranges from 1930s until 2009
library(DT)
set.seed(123)
yr<-gsub(".*\\b([0-9]{4})\\b.*", "\\1", ext.lang.elevation$Sources)
sort(unique(gsub("[^0-9]{4}.*", "", unique(yr))))
## [1] "" "1596" "1930" "1932" "1952" "1959" "1966" "1967" "1976" "1977"
## [11] "1979" "1981" "1982" "1983" "1985" "1986" "1987" "1988" "1989" "1991"
## [21] "1992" "1993" "1994" "1995" "1996" "1997" "1998" "1999" "2000" "2001"
## [31] "2002" "2003" "2004" "2005" "2006" "2007" "2008" "2009"
Barro-Lee world education statistics begins at 1980 and collected every 5 years term. The last date is on 2010, so we first obtained the data and perform unstack, and finally select the collection year (total of 8 years of statistic from 1980 to 2010 with 5 years interval).
edstats.barrolee<-filterByYearUnstack(edstats, 1980, 2016) %>% getColByName (colname="Barro-Lee.*", Country.Code, year)
barrolee.freq<-seq(1980, 2010, by=5)
BarroLee<-which(grepl("Barro-Lee.*", names(edstats.barrolee)))
edstats.barrolee<-bind_rows(sapply(barrolee.freq,
function(x) edstats.barrolee %>% filter(year==as.character(x)), simplify = F))
Not all country code listed in the World Bank education statistics is covered by Barro-Lee, here, we select the country that is listed.
country.code.no.data<-unique(apply(
edstats.barrolee[edstats.barrolee$year=="1980",], 2,
function(x) length(which(is.na(x)))))[1]
a<-which(apply(edstats.barrolee[1:242,], 1, function(x) sum(is.na(x))) > 0) ## Country with no statistics
Below are the list of countries that is in World Bank data set but not in Barro-Lee data.
sapply(edstats.barrolee[a, "Country.Code"],
function(x) edstats[edstats$Country.Code==x, "Country.Name"][1])
## [1] Aruba
## [2] Andorra
## [3] Angola
## [4] Arab World
## [5] American Samoa
## [6] Antigua and Barbuda
## [7] Azerbaijan
## [8] Burkina Faso
## [9] Bahamas, The
## [10] Bosnia and Herzegovina
## [11] Belarus
## [12] Bermuda
## [13] Bhutan
## [14] Channel Islands
## [15] Comoros
## [16] Cabo Verde
## [17] Curacao
## [18] Cayman Islands
## [19] Djibouti
## [20] Dominica
## [21] East Asia & Pacific (excluding high income)
## [22] East Asia & Pacific
## [23] Europe & Central Asia (excluding high income)
## [24] Europe & Central Asia
## [25] Euro area
## [26] Eritrea
## [27] Ethiopia
## [28] European Union
## [29] Faroe Islands
## [30] Micronesia, Fed. Sts.
## [31] Georgia
## [32] Gibraltar
## [33] Guinea
## [34] Guinea-Bissau
## [35] Equatorial Guinea
## [36] Grenada
## [37] Greenland
## [38] Guam
## [39] High income
## [40] Heavily indebted poor countries (HIPC)
## [41] Isle of Man
## [42] Kiribati
## [43] St. Kitts and Nevis
## [44] Kosovo
## [45] Latin America & Caribbean (excluding high income)
## [46] Lebanon
## [47] St. Lucia
## [48] Latin America & Caribbean
## [49] Least developed countries: UN classification
## [50] Low income
## [51] Liechtenstein
## [52] Lower middle income
## [53] Low & middle income
## [54] St. Martin (French part)
## [55] Monaco
## [56] Madagascar
## [57] Middle East & North Africa
## [58] Marshall Islands
## [59] Middle income
## [60] Macedonia, FYR
## [61] Middle East & North Africa (excluding high income)
## [62] Montenegro
## [63] Northern Mariana Islands
## [64] North America
## [65] New Caledonia
## [66] Nigeria
## [67] Nauru
## [68] OECD members
## [69] Oman
## [70] Palau
## [71] Puerto Rico
## [72] Korea, Dem. People’s Rep.
## [73] French Polynesia
## [74] South Asia
## [75] Solomon Islands
## [76] San Marino
## [77] Somalia
## [78] Sub-Saharan Africa (excluding high income)
## [79] South Sudan
## [80] Sub-Saharan Africa
## [81] Sao Tome and Principe
## [82] Suriname
## [83] Sint Maarten (Dutch part)
## [84] Seychelles
## [85] Turks and Caicos Islands
## [86] Chad
## [87] Turkmenistan
## [88] Timor-Leste
## [89] Tuvalu
## [90] Upper middle income
## [91] Uzbekistan
## [92] St. Vincent and the Grenadines
## [93] British Virgin Islands
## [94] Virgin Islands (U.S.)
## [95] Vanuatu
## [96] West Bank and Gaza
## [97] World
## [98] Samoa
## 242 Levels: Afghanistan Albania Algeria American Samoa Andorra ... Zimbabwe
We filter the 98 countries stated above and also took out variables that is stating the country population. The Barro-Lee statistics are in percentage form, so the value already normalize to each country.
edstats.barrolee.l<-edstats.barrolee %>% filter(!(Country.Code %in% edstats.barrolee[a, "Country.Code"]))
b<-grep("Population", names(edstats.barrolee.l)) ## Take out Population statistics
edstats.barrolee.l<- edstats.barrolee.l %>% select(-b)
Finally we find the differences in countries covered in language endangered list and Barro-Lee education statistics, this result in 10.7% data being drop from the language endangered list.
c<-setdiff(ext.lang.elevation$Country.codes.alpha.3.coord, edstats.barrolee.l$Country.Code)
round(sum(sapply(c,
function(x)
length(
ext.lang.elevation[ext.lang.elevation$Country.codes.alpha.3.coord==x,
"Countries"]))) / nrow(ext.lang.elevation), 3) * 100
## [1] 10.7
## 10.7% no data with Barro Lee
We also take into account the speaker population, and filter items in the language endangered list that do not have number of speakers listed, this contribute around 6.7% of data. Total data drop from merging is 18% and retained 2259 rows from original 2722 rows.
We finally do a left merge on the language endangered list with 8 years of Barro-Lee data by country code, arriving with 15,800 observation with 213 variables. The left join will replicate data on the Barro-Lee country education statistic to each endangred language within the country.
# edstats.barrolee.2000<-edstats.barrolee.l %>% filter(year == "2000")
## Take out country from extinction list that not included in Barro-Lee stats 10.7%
## Take out language with no Number of Speakers data 6.7%
new.ext<-ext.lang.elevation %>% filter(!(Country.codes.alpha.3.coord %in% c)) %>%
filter(!is.na(Number.of.speakers)) %>%
select(ID, Degree.of.endangerment, Name.in.English, Number.of.speakers, elevation, Country.codes.alpha.3.coord )
total.speakers<-sum(new.ext$Number.of.speakers)
new.ext<-new.ext %>% mutate(Number.of.speakers.pct = (Number.of.speakers/total.speakers)*100)
## Merge for all years
new.ext<-merge(new.ext, edstats.barrolee.l, by.x="Country.codes.alpha.3.coord",
by.y = "Country.Code", all.x=T) %>% select ( - matches("year") )
showTable(new.ext[1:1000,] %>% select (1, 3:8))
We first try to compress the data and visualized if there’s any pattern that separate the language endangered levels with the variables we collected so far using PCA.
library(RColorBrewer)
BarroLee<-which(grepl("Barro-Lee.*", names(new.ext)))
pca.data<-as.matrix(new.ext %>% select( elevation, Number.of.speakers.pct, BarroLee))
rownames(pca.data)<-as.character(new.ext$Degree.of.endangerment)
Performing PCA and using rank to normalize the data, most of the data from Barro-Lee is stated as percentage which is normalized, the rank here mainly to normalize the elevation and number of speakers data.
pca.barrolee <- princomp(apply(pca.data, 2, rank))
lambda <- (pca.barrolee$sdev)^2
propvar <- round(lambda / sum(lambda) * 100, 0)
Plotting with PC1 with PC2, PC1 with PC3 and PC2 with PC3 to find any pattern in the data, the coloring of points are with the levels of endangered languages.
par(mfrow=c(1,3))
t<-lapply(list(a=c(1,2), b=c(1,3), c=c(2,3)), function(x) {
plotPCA(x[1], x[2], pca.barrolee, brewer.pal(5, "YlOrRd"), titl=F)
title("PCA on Barro-Lee statistics,\nElevation and #Speakers")
})
par(mfrow=c(1,1))
Bigger plot on PC1 vs PC2, there’s apparent pull in 2 end of PC1 scale, but there’s no apparent cluster form in plot.
color<-brewer.pal(5, "YlOrRd")
# color<-c(rep("blue", 2), rep("red", 3))
plotPCA(1,2, pca.barrolee, color, titl=F)
title("PCA on PC1 and PC2 Barro-Lee World Education statistics,\nElevation and #Speakers")
The variables/features that contribute to education statistics may be convoluted and hard to give clear insights on the impact to language endangered risk. Here we use F-statistics to find the highest between group and within group feature that we can choose to represent as education factor.
require(knitr)
f.test <- pca.data[,-c(1:2)]
Fstat <- apply(f.test, 2, function(k) {
y <- stack(k)
result <- summary (aov(values~ind, data=y))
unlist(result[[1]][4])[1] ## F-value
} )
ranked <- order(Fstat, decreasing=TRUE)
kable(data.frame("F-stat"=Fstat[ranked[1:5]], Feature=colnames(f.test)[ranked[1:5]]), row.names = F)
| F.stat | Feature |
|---|---|
| 240.6161 | Barro-Lee: Percentage of female population age 75+ with secondary schooling. Completed Secondary |
| 229.6198 | Barro-Lee: Percentage of female population age 70-74 with secondary schooling. Completed Secondary |
| 225.7698 | Barro-Lee: Percentage of female population age 65-69 with secondary schooling. Completed Secondary |
| 222.1598 | Barro-Lee: Percentage of population age 75+ with secondary schooling. Completed Secondary |
| 218.5045 | Barro-Lee: Percentage of female population age 75+ with tertiary schooling. Total (Incomplete and Completed Tertiary) |
From the result above Percentage of female population age 75+ with secondary schooling. Completed Secondary is number if F-score. Older generation female starting school and have completed secondary level of schooling are contributing the highes in between group variation. In a sociological context, female is care giver for family and culture language continuation depends much on mothers commication with the children to extend the language continual for next generation. And older generation female of population that have completed secondary school showing a break on the use of cultural language within family for at least 1 generation.
We then compile the PCA again using just 3 variables speakers location elevation, number of speakers in percent of total speakers and Percentage of female population age 75+ with secondary schooling. Completed Secondary.
pca.feature.select<-pca.data[,
c(which(colnames(pca.data) %in% c("elevation", "Number.of.speakers.pct",
"Barro-Lee: Percentage of female population age 75+ with secondary schooling. Completed Secondary")))]
pca.barrolee <- princomp(apply(pca.feature.select, 2, rank))
lambda <- (pca.barrolee$sdev)^2
propvar <- round(lambda / sum(lambda) * 100, 0)
To show clustering among classes of language endangerment, we collapse the 5 levels to 3, using Vulnerable and Definately Endangerment with color BLUE (LOW), Severely Endangerment and Critically Endangerment with GREEN (MED) and finally Extinct with RED (HIGH). We tried a few combination of scatter plot and PC1 to PC3,
color<-c(rep(rgb(0,0,1,0.3), 2), rep(rgb(0,1,0,0.3), 2), rgb(1,0,0,0.3))
plotPCA(1,3, pca.barrolee, color)
plotPCA(1,2, pca.barrolee, color)
plotPCA(2,3, pca.barrolee, color)
library(GISTools)
color<-add.alpha(brewer.pal(5, "Spectral"), alpha=0.3)
plotPCA(1,3, pca.barrolee, color)
color<-add.alpha(c(rgb(1,0,0), rgb(0,1,0), rgb(0,0,1)), alpha=0.3)
c<-1
extinc<-which(new.ext$Degree.of.endangerment=="Extinct")
critical<-which(new.ext$Degree.of.endangerment=="Severely endangered" |
new.ext$Degree.of.endangerment=="Critically endangered")
vulnerable<-which(new.ext$Degree.of.endangerment=="Vulnerable" |
new.ext$Degree.of.endangerment=="Definitely endangered")
sets<-list(extinc, critical, vulnerable)
par(mfrow=c(1,3))
x<-c(1,3)
z<-lapply(sets, function(y){
plot(pca.barrolee$scores[y,x[1]], pca.barrolee$scores[y,x[2]], pch=21,
xlab=paste(c("PC",x[1],"(",propvar[x[1]],"%)"), collapse=""),
ylab=paste(c("PC",x[2],"(",propvar[x[2]],"%)"), collapse=""),
col=rgb(0,0,0,0),
xlim=c(-10000, 10000), ylim=c(-10000,10000),
bg=rep(color[c], length(y)))
abline(h=0,v=0, lty=2)
c<<-c+1
})
The loading on PC1 and PC3 showing Number.of.speakers.pct and elevation are contributing positive scaling on PC1 (right side of x-axis) and Percentage of female population age 75+ with secondary schooling. Completed Secondary give negative scale (towards left side of x-axis). PC3 give better polar for the 2 feature of Number.of.speakers.pct and elevation where elevation scale towards positive side (top of y-axis).
kable(data.frame(PC1=pca.barrolee$loadings[,1], PC3=pca.barrolee$loadings[,3]),
caption = "Loadings on the 3 features on PC1 and PC3")
| PC1 | PC3 | |
|---|---|---|
| elevation | 0.5856634 | 0.6156475 |
| Number.of.speakers.pct | 0.6160258 | -0.7608224 |
| Barro-Lee: Percentage of female population age 75+ with secondary schooling. Completed Secondary | -0.5267928 | -0.2052496 |
Reducing features from education statistics we pick a single feature that gives higest between class variation using F-statistics. It shows the female with 75+ generation that completes secondary school increase the risk of not passing on cultural language to the next generation. As they are the main care giver for their children upbringing and receiving education increase their cultural shift and expose to usage of national language. Together with elevation which drive community isolation and the speaker population which drive growth in the language. Higher elevation also shows negative scale towards PC3 as higher elevation also translate to less population and can increase the risk of language growth.
Even though we can see some partition in the 3 classes, there’s still some proportion of the data overlaps when we look at the separate plot, especially for category on vulnerable, definately endangerment with severely endangerment, critical endangerment.
On the last note, we can see a clear partition when we reduce the classification from 5 levels to 3 (even though some propotion of poins overlap), we will continue to explore factors that give more distinct isolation between the classes.
Reading the World Bank education statistics and our first enriched language extinction data set with Elevation and country reference tidy data.
require(reshape2)
require(dplyr)
edstats<-read.csv("../input/Edstats_csv/Edstats_Data.csv")
ext.lang.elevation<-readRDS(file="collected_data/ext.lang.elevation.country.normalized.RDS")
doe.levels<-levels(factor(ext.lang.elevation$Degree.of.endangerment,
levels=c("Vulnerable", "Definitely endangered",
"Severely endangered", "Critically endangered", "Extinct")))
Reusing function from World Education Statistic for unstack the data from world bank
filterByYearUnstack<-function(edstats.df, fromYear=1976, toYear=2016) {
getYear<-function(yr) {
as.numeric(gsub("X([0-9]{4})", "\\1", yr))
}
edstats.years<-data.frame()
for(n in names(edstats)) {
if (grepl("X[0-9]{4}", n) && getYear(n) >=fromYear && getYear(n) <= toYear)
edstats.years<-rbind(edstats.years, edstats.df %>%
dcast(Country.Code~Indicator.Name, value.var= n) %>%
mutate(year=gsub("X([0-9]{4})", "\\1", n)))
}
edstats.years
}
getColByName<-function(.data, colname, ... ) select(.data, which(grepl(colname, names(.data))), ...)
showTable<-function(data) {
datatable(data, options = list(pageLength = 10))
}
Function to plot 2 components of PCA
plotPCA<-function(f, s, pca.barrolee, color, titl=TRUE) {
plot(pca.barrolee$scores[,f], pca.barrolee$scores[,s], pch=21,
xlab=paste(c("PC",f,"(",propvar[s],"%)"), collapse=""),
ylab=paste(c("PC",s,"(",propvar[s],"%)"), collapse=""),
col=rgb(0,0,0,0), bg=color[unclass(factor(new.ext$Degree.of.endangerment,
levels=c("Vulnerable", "Definitely endangered", "Severely endangered",
"Critically endangered", "Extinct")))])
legend("bottomright", doe.levels, col=color,
cex=0.7,
pch=rep(19, length(doe.levels)))
if (titl)
title(paste("PCA PC", f," and PC", s, sep=""))
}